Entity Profile Extraction from Large Corpora
نویسندگان
چکیده
Information Extraction (IE) has two anchor points: (i) entity-centric information leads to an Entity Profile (EP); (ii) action-centric information leads to an Event Scenario. Based on a pipelined architecture which involves both document-level IE and corpus-level IE, a multi-level modular approach to EP extraction from large corpora is described: (i) named entity tagging; (ii) three-level pattern matching for extracting the underlying correlated entity relationships; (iii) co-reference; (iv) document-internal merging of entity relationships into discourse EPs; and (v) cross-document fusion of EPs. The approach achieves around 90% precision and 50%-70% recall for major EP relationships. The significance of EP enhanced by cross-document fusion is demonstrated.
منابع مشابه
Exploratory Relation Extraction in Large Text Corpora
In this paper, we propose and demonstrate Exploratory Relation Extraction (ERE), a novel approach to identifying and extracting relations from large text corpora based on user-driven and data-guided incremental exploration. We draw upon ideas from the information seeking paradigm of Exploratory Search (ES) to enable an exploration process in which users begin with a vaguely defined information ...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملDistant supervision for relation extraction without labeled data
Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACEstyle algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relation...
متن کاملA Resource-Based Method for Named Entity Extraction and Classification
We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Languageindependent heuristics are used to disambiguate and classify enti...
متن کاملRelation Extraction with Massive Seed and Large Corpora
The research area of information extraction (IE) aims to extract relevant structured information from natural language texts. In addition to the named-entity recognition (NER) task, the identification and classification of relations among entities, namely, the so-called relation extraction (RE) task, is particularly important for many real-world applications. Given the sentence in Figure 1, a R...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003